Robust differential expression testing for single-cell CRISPR screens at low multiplicity of infection

Single-cell CRISPR screens (perturb-seq) link genetic perturbations to phenotypic changes in individual cells. The most fundamental task in perturb-seq analysis is to test for association between a perturbation and a count outcome, such as gene expression. We conduct the first-ever comprehensive benchmarking study of association testing methods for low multiplicity-of-infection (MOI) perturb-seq data, finding that existing methods produce excess false positives. We conduct an extensive empirical investigation of the data, identifying three core analysis challenges: sparsity, confounding, and model misspecification. Finally, we develop an association testing method — SCEPTRE low-MOI — that resolves these analysis challenges and demonstrates improved calibration and power. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-024-03254-2.

We conducted a test of association between each NT gRNA and biological replicate.We plotted the resulting p-values on a QQ plot (left); each point in the plot corresponds to a different NT gRNA.Likewise, we carried out a test of association between the expression of each gene and biological replicate (right); each point in the plot similarly corresponds to a different gene.We observed that the two sets of p-values were inflated, indicating that the bulk of the NT gRNAs and the bulk of the genes was associated with biological replicate.Because biological replicate affected both gRNA presence/absence and gene expression, we reasoned that biological replicate acted as a confounder on the Papalexi data.The section Details of the investigation into the core analysis challenges ("Confounding" subsection) describes how the tests of association were conducted.

S2 Additional mathematical details of SCEPTRE (low-MOI)
This section contains additional mathematical details of SCEPTRE (low-MOI).First, we derive the expression for the GLM score statistic used within the method, and next, we describe IWOR sampling.

Derivation of the expression for the GLM score test statistic
We derive the GLM score test statistic for adding a single variable to a fitted GLM model.(We are not aware of any book or paper that presents this derivation.)Let Y ∈ R n be the vector of responses.Let Z ∈ R n×p be the matrix of variables onto which we are regressing Y .(Assume that Z includes a column of ones corresponding to the intercept.)Let X ∈ R n be the vector we seek to test for inclusion in the fitted GLM.Let D = [X Z] ∈ R n×(p+1) be the augmented design matrix.We model the mean µ i of Y i according to where β = [β 1 , . . ., β p ] T ∈ R p and γ are constants, and g : R → R is a link function.Let θ = [β, γ] T ∈ R p+1 denote the vector of unknown parameters.Also, let η i = g −1 (µ i ) denote the linear component of the model, and let V (µ i ) denote the variance of the response y i given mean µ i .
Standard GLM results indicate that the score u and Fisher information I of the model evaluated at θ are where L(θ) is the log-likelihood of the model evaluated at θ, is the matrix of "weights" and is the matrix of derivatives of the linear component with respect to the mean.We can rewrite the score as Next, we can rewrite the Fisher information as Define I X by I X = I 22 − I 21 I −1 11 I 12 .We have that Let β be the MLE for β.Next, let u X denote the component of the score vector corresponding to γ, i.e., u X (θ) = X T W M (Y − µ).
The value of u X evaluated at ( β, 0) is where Ŵ , M , and μ denote W, M , and µ, respectively, evaluated at ( β, 0).Similarly, the value of I X (θ) evaluated at ( β, 0) is Standard asymptotic results indicate that The GLM score test z-score is therefore The GLM score test statistic typically is expressed in a different way in books and papers.[1]Let E 2 denote the matrix of residuals after least squares regression of the columns of X onto Z, i.e., Furthermore, define r = M (y − μ) as the "working residual" vector.The GLM score test statistic typically is expressed as The expressions (1) and (2) are identical: Expression (2) is evaluated via QR decomposition in practice.By contrast, we propose to evaluate (1) via spectral or Cholesky decomposition, thereby exploiting the sparsity of X.

Inductive without replacement sampling
Permutation testing is a computationally expensive statistical technique.There are two main steps involved permutation testing: first, permuting the treatment vector B times, where B is some large integer (e.g., B = 5, 000), and second, recomputing the test statistic on the observed dataset and the B permuted datasets.Both of these steps are slow; which is slower depends on several factors, including the test statistic in use and the sparsity of the treatment vector.Permutation testing is even more expensive in the "high-throughput setting" in which one seeks to test tens of thousands (or more) of hypotheses via a permutation test.We introduce "inductive without replacement" (IWOR) sampling, a technique for recycling compute across permutation tests in which the number of "control units" is constant across tests (as in low-MOI single-cell CRISPR screens).IWOR sampling is a (provably) approximation-free technique, yielding an exact and correct permutation p-value for every hypothesis tested.

A concrete example of IWOR sampling
We first provide a concrete example of IWOR sampling to help convey intuition for the procedure.(We describe IWOR sampling in a mathematically precise way in the following section.)Suppose that we are analyzing a single-cell CRISPR screen dataset containing 19 cells, four of which contain a non-targeting perturbation, four of which contain targeting perturbation A, five of which contain targeting perturbation B, and six of which contain targeting perturbation C. Suppose that we seek to test for association between each of the targeting perturbations and some gene.Call pair A (resp., pair B, pair C ) the pair that results from coupling perturbation A (resp., perturbation B, perturbation C ) to the gene.
Figure S10 depicts the "observed treatment vector" corresponding to pair A, pair B, and pair C. For example, the observed treatment vector for pair A contains four treatment cells (blue circles) and four control cells (gray circles), while the observed treatment vector for pair B contains five treatment cells and four control cells, etc.We can enumerate the cells contained within each pair by assigning an integer index to each cell.For example, we can enumerate the cells contained within pair A by 1, . . ., 8, and we can enumerate the cells contained within pair B by 1, . . ., 9, etc. (Note: although we reuse integer indices across pairs, the cells themselves are not shared across pairs.)We seek to permute the observed treatment vector corresponding to each pair B times.This is equivalent to drawing B without replacement samples of size k from the set of indices 1, . . ., k + N , where k is the number of treatment cells in a given pair and N is the number of control cells in the dataset.For example, consider the observed treatment vector corresponding to pair A (Figure S10, left column).We can permute this vector by assigning cells 2, 4, 5, and 6 to "treatment" status and assigning the remaining cells to "control" status.Moreover, we can identify this particular permutation of the vector with the set of integers {2, 4, 5, 6}, where {2, 4, 5, 6} can be thought of as a without-replacement sample from the set of indices 1, . . ., 8. Here, 8 is the sum of the number of control cells (4) and the number of treatment cells (4).
Our goal is to randomly permute the observed treatment vector corresponding to each pair.IWOR sampling is a strategy for doing this efficiently.An example IWOR sample for this dataset is v := [5, 6, 4, 2, 9, 10] (Figure S10, top).The set that contains the first k elements of v constitutes a valid without-replacement sample of size k from the set of indices 1, . . ., k + 4. (The significance of 4 in this context is that 4 is the number of control cells in the example dataset.)For example, the set constructed from the first four elements of v -{5, 6, 4, 2} -is a without-replacement sample from the set 1, . . ., 8. Thus, {5, 6, 4, 2} is a valid permutation of the treatment vector corresponding An "IWOR sample" (top) is a single vector from which the set of permutation indices for each hypothesis can be extracted.The permutation indices for pair A consist of the first four elements of the IWOR sample; the permutation indices for pair B consist of the first five elements of the IWOR sample; and the permutation indices for pair C consist of the first six elements of the IWOR sample.Importantly, the cells themselves are not shared across hypotheses; rather, the IWOR sample is shared across hypotheses.IWOR sampling assumes that the number of control units is fixed across hypotheses; this assumption holds on low-MOI single-cell CRISPR screen data.
to pair A, which contains four treatment cells.
More generally, suppose a given dataset contains N control cells.Let x be a random (i.e., unrealized) IWOR sample, and let x[1 : k] denote the first k elements of x.The IWOR sample is carefully constructed such that it satisfies the following crucial property: In other words, the first k elements of x constitute a valid WOR sample from the set {1, . . ., k + N }.The permutation defined by x[1 : k] is therefore a valid permutation for a pair containing k treatment cells and N control cells.

Definition of IWOR sampling
We provide a more precise definition of IWOR sampling here.Consider a gene expression vector Y ∈ R n , technical factor matrix Z ∈ R n×p , and binary "treatment" vector of perturbation indicators X ∈ {0, 1} n .We can identify a given treatment vector with the set of nonzero entries within that vector.In other words, we can represent the vector X as the set S = {i : X[i] = 1}, where X[i] denotes the value of the vector X at position i.Suppose that X has m nonzero entries.Randomly permuting X is equivalent to sampling m elements without replacement (WOR) from the set {1, . . ., n}.Unfortunately, drawing B WOR samples for every perturbation-gene pairwhich is required for carrying out the permutation test dataset-wide -is slow.IWOR sampling shares WOR samples across pairs.
A key observation is that the number of control cells is fixed across all hypotheses, while the number of treatment cells -i.e., the number of cells containing the targeting perturbationvaries from hypothesis to hypothesis (depending on the targeting perturbation).We exploit this structure of the problem as follows.Let N be the number of control cells; label the control cells by c 1 , c 2 , . . ., c N .Assume that there are p targeting perturbations.For perturbation i ∈ {1, . . ., p}, let k i be the number of cells containing perturbation i.Finally, let M = max i∈{1,...,p} k i be the number of cells containing the perturbation that infects the greatest number of cells.Label the treatment cells by t 1 , . . ., t M .We construct a length-M random sequence a 1 , a 2 , . . ., a M that satisfies the following three properties: 1. a i ∈ {c 1 , . . ., c N , t 1 , . . ., t i } for all i ∈ {1, . . ., M }; that is, the ith element of the sequence is a control cell or one of the first i treatment cells.
2. a i ̸ = a j for i ̸ = j.That is, the elements of the sequence are unique.
Let A i := {a 1 , . . ., a i } denote the set containing the first i elements of the sequence.(Observe that there are N + i elements that possibly could be in A i : the N control cells and the first i treatment cells.)We additionally require the following property to hold: That is, each of the elements that possibly could be in A i has an equal chance of being in A i .
The sequence a 1 , . . ., a M yields a sequence of increasing sets The first set A 1 , which contains a single element, contains the control cells and the first treatment cell with equal probability.The second set A 2 , which contains two elements, contains the control cells and each of the first two treatment cells with equal probability, and so on.Thus, if a given geneperturbation pair contains k ≤ M treatment cells, then A k constitutes a valid WOR sample for that pair.Importantly, because the sequence of sets A 1 , . . ., A M is increasing, we can share the random sample a 1 , . . ., a M across all gene-perturbation pairs.We call this technique "inductive without replacement sampling" (or "IWOR sampling").The SCEPTRE (low MOI) algorithm generates a set of B length-M IWOR samples and shares this set of samples across all pairs.

An abstract approach for constructing an IWOR sample
We describe an abstract approach for constructing an IWOR sample and prove its correctness.The procedure is inductive.
Step 1. Sample one element from the set {c 1 , . . ., c N , t 1 }, putting an equal mass of 1/(N + 1) onto each of the elements.Call the sampled element a 1 .Set A 1 = {a 1 }.
Step i for i ∈ {2, . . ., M }.Let B i := {c 1 , . . ., c N , t 1 . . ., t i−1 } \ A i−1 denote the set of "leftover" elements not sampled in steps 1, . . ., i − 1.There are N elements in the set B i .Draw an element at random from the set B i {t i }, placing a mass of i i+N on t i and a mass of 1 The resulting sequence a 1 , . . ., a M satisfies the three properties listed above, which we now prove.
Proof : It is clear that a i ∈ {c 1 , . . ., c N , t 1 , . . ., t i } for all i and that a i ̸ = a j for i ̸ = j.We focus on proving the third property, which we do inductively.
Base case: Let i = 1.Then Inductive step: Suppose that To construct the set A i+1 , we sample t i+1 with probability (i + 1)/(N + i + 1) and each element of We call the sampled element a i+1 , and we set A i+1 = A i ∪ {a i+1 }.Our goal is to show that, for all u ∈ {c 1 , . . ., c N , t 1 , . . ., t i+1 }, (That is, for all elements that could be in A i+1 , each has an equal chance of actually being in A i+1 .) The equality (3) holds for u = t i+1 by construction.Next, suppose that u ∈ {c 1 , . . ., c N , t 1 , . . ., t i }.
We compute P(u ∈ A i+1 ) using the law of total probability: We consider first the term P(u ∈ A i+1 |u ∈ A i )P(u ∈ A i ).We have that P(u ∈ A i+1 |u ∈ A i ) = 1, as membership of u in A i implies membership of u in A i+1 .By the inductive hypothesis, P(u Next, we consider the term P(u Combining (4), (5), and (6),

□
A concrete algorithm for IWOR sampling We now derive a concrete algorithm for constructing an IWOR sample of size M on the set {c 1 , . . ., c N , t 1 , . . ., t M }.The algorithm requires only a high-quality uniform random number generator, which is readily available in most programming languages.First, we describe a simple algorithm for sampling from the discrete probability distribution with support {0, . . ., N } that places mass i/(i + N ) on N and mass (1/N )[1 − i/(i + N )] on {0, 1, ..., N − 1}.(We call this distribution the IWOR(N, i) distribution.)The algorithm, given in Algorithm 1, takes O(1) time and space.Next, we provide an algorithm for constructing an IWOR sample of length M on the set of control cells {c 1 , . . ., c N } and treatment cells {t 1 , . . ., t M }.Algorithm 2 is a concrete instantiation of the "abstract approach for constructing an IWOR sample," as described in the previous section.The vector r is a vector of length N + 1.At step i, the N cells that remain (i.e., those that have not yet been sampled among {c 1 , ..., c N , t 1 , . . ., t i−1 }) are stored in the leftmost N positions of r.The cell t i is then inserted into the rightmost position of r.A cell is sampled from r via the IWOR(N, i) distribution, which places more mass on the cell in the rightmost position (i.e., t i ).The sampled cell is moved into v.Then, the cell in the rightmost entry of r (i.e., t i ) is moved into the position vacated by the sampled cell.Finally, the cell t i+1 is moved into the rightmost position of r.We Algorithm 1: Generating a sample from the IWOR(N, i) distribution.
Input N and i u ∼ U (0, 1) SCEPTRE (low-MOI) uses a slightly modified, more efficient, and more numerically stable version of Algorithm 2 for constructing the IWOR sample.Suppose there are p targeting perturbations, with the ith perturbation infecting k i cells.Let M = max i∈{1,...,p} k i be the number of cells containing the perturbation that infects the greatest number of cells, and let m = min i∈{1,...,p} k i be the number of cells containing the perturbation that infects the least number of cells.SCEPTRE first samples k cells without replacement from the set c 1 , . . ., c N , t 1 , . . ., t k .Then, SCEPTRE constructs an IWOR sample, taking as input the cells that remain from the first step and treatment cells t k+1 , t k+2 , . . ., t M .

S3 Additional empirical analyses
In this section we present several additional empirical analyses.
Comparing the score statistic to the difference-in-residual-means statistic SCEPTRE (low-MOI) uses a permutation test based on an NB GLM score statistic.A natural question is whether SCEPTRE (low-MOI) is better (in some sense) than the following alternative method: first, "regress out" the covariates by regressing the response vector onto the covariate matrix via an NB GLM; second, extract the residuals1 from the fitted GLM; third, perform a permutation test between the treatment vector (i.e., the vector of perturbation presences/absences) and the vector of residuals, taking as the test statistic the standard difference in means statistic.Using mathematical language, let Y = [Y 1 , . . ., Y n ] T be the vector of gene (or protein, etc.) UMI counts, X = [X 1 , . . ., X n ] T the vector of perturbation presence or absence indicators, and Z ∈ R n×p the matrix of technical factors (or covariates).Regressing Y onto Z via an NB GLM yields a vector of fitted means [μ 1 , . . ., μn ].Define the ith (raw) residual by r i = y i − μi .The difference-in-residualmeans test statistic T resid is defined as where s = n i=1 X i is the number of "treated" cells (i.e., the number of cells containing a perturbation).The GLM score statistic, meanwhile, is defined as where W and M (Y − μ) are given by respectively.Clearly, T score is more complex than T resid : evaluating the denominator of the former involves performing matrix multiplication, addition, transposition, and inversion operations.How (if at all) do T score and T resid differ with respect to the p-values that they produce?To explore this question, we applied a permutation test based on T score and another based on T resid to real and simulated data, comparing the two methods with respect to type-I error control, power, and computational performance.We found that T score and T resid exhibited similar type-I error control and computational performance but that T score could be much more powerful than T resid .We note that these results -while interesting -are preliminary; we intend to carry out a more in-depth comparison between T score and T resid in a followup, more statistically-focused work.
Simulation study.We first compared T resid to T score on simulated data.We used n = 5, 000 cells in the simulation study.We constructed the cell-specific covariates z 1 , . . ., z n ∈ R 3 by drawing n i.i.d.samples from a bivariate normal distribution and appending a one for the intercept term.
(The cell-specific covariates were meant to represent technical factors, such as sequencing depth and percent mitochondrial reads.)Next, we generated the ith binary treatment indicator x i by sampling x i from a Bernoulli distribution with mean µ i = σ −1 (τ T z i ), where σ was the logit function and τ ∈ R 3 a fixed constant.Thus, the treatment x i and covariate z i were correlated, putting us into the confounded (as opposed to unconfounded ) problem setting.
We then simulated gene expression data on n genes = 500 genes.For a given gene j ∈ {1, . . ., n genes }, we sampled UMI counts y j 1 , . . ., y j n from an NB GLM model: where β T ∈ R p was a fixed regression coefficient, θ j > 0 was a gene-specific NB size parameter, and γ j ∈ R was a gene-specific treatment effect.To simulate variation in overdispersion across genes, we sampled θ 1 , . . ., θ nsim from a Unif(0.1, 5) distribution.Additionally, to simulate variation in response to the treatment, we set the gene-specific treatment effect γ j to 0 for 450 (randomlyselected) genes, and we set γ j to a small, positive constant for the remaining 50 genes.Thus, for 450 (or 90% of) genes, we simulated data from the null model (in which treatment and response were not associated), and for 50 (or 10% of) genes, we simulated data from an alternative model (in which treatment and response were associated).
We applied three methods to analyze the data: a permutation test based on the differencein-residual-means statistic T resid , a permutation test based on the score test statistic T score , and a GLM likelihood ratio test.We employed the adaptive permutation testing scheme and skewnormal approximation of SCEPTRE (low-MOI) to accelerate the two permutation-based methods (see Acceleration 3: Adaptive permutation testing and Acceleration 4: Skew-normal fit in the Methods section for details).Additionally, to simplify comparison of the methods, we treated the NB size parameter θ j as fixed and known.(We estimated the NB size parameter in the real data analysis; see next section.)The GLM likelihood ratio test consisted of comparing the full model ( 9) to the reduced model that results from setting γ j to zero in (9).We note that the GLM likelihood ratio test was optimal in the simulation study setting (i.e., the setting in which the GLM is correctly specified and the sample size is sufficiently large so as to enable accurate asymptotic inference).The T resid -based permutation test and T score -based permutation test were applied with two-sided test statistics so as to enable a meaningful comparison with the GLM likelihood ratio test, which is an inherently two-tailed test.
We assessed the degree of concordance between the T score -based permutation test p-values, the T resid -based permutation test p-values, and the GLM likelihood ratio test (LRT) p-values.As the GLM LRT p-values were optimal, we looked for agreement between the GLM LRT p-values and the p-values outputted by the two permutation methods.First, we plotted the GLM LRT p-values against the T resid -based permutation test p-values (Figure S11a).Each point in the plot represents one of the 500 hypotheses tested; the red (resp., blue) points correspond to the null (resp., alternative) hypotheses.The vertical (resp., horizontal) position of a given point indicates the GLM LRT (resp., T resid -based permutation test) p-value computed on that hypothesis.The p-values were negative log-transformed to facilitate visualization of small p-values.We observed that the T residbased p-values were strongly and positively correlated with the GLM LRT p-values.However, the T resid -based p-values were generally larger (i.e., less significant) than the GLM LRT p-values.We regressed the GLM LRT p-values onto the T resid -based p-values (via OLS) and superimposed the line of best fit onto the plot (dashed orange line).The line of best fit clearly lay above the identity line (solid black line), indicating deflation of the T resid -based p-values relative to the GLM LRT p-values.
Next, we plotted the GLM LRT p-values against the T score -based permutation p-values (Figure S11b).The two sets of p-values were highly concordant, as evidenced by the tight clustering  of the points around the identity line (solid black line).Furthermore, the line of best fit (solid orange line) closely tracked the identity line, indicating a high degree of agreement between the two sets of p-values.Finally, we plotted the T score -based permutation p-values against the T residbased permutation p-values (Figure S11c).As expected, the T score -based p-values were generally smaller (i.e., more significant) than the T resid -based p-values.We concluded that the T score -based permutation p-values demonstrated a higher degree of concordance with the GLM LRT p-values than did the T resid -based permutation p-values.Finally, we subjected the three sets of p-values (i.e., those produced by the T score -based permutation test, the T resid -based permutation test, and the GLM LRT) to a BH correction at level 10%, obtaining a discovery set for method.The GLM LRT yielded 41 discoveries, two of which were false discoveries (false discovery proportion = 4.9%); next, the T score -based permutation test produced 42 discoveries, two of which were false discoveries (false discovery proportion = 4.7%); finally, the T resid -based permutation test yielded 25 discoveries and zero false discoveries (false discovery proportion = 0%).All three methods controlled the false discovery proportion at the nominal level of 10%.However, the T score -based permutation test and GLM LRT yielded about 68% more discoveries than the T resid -based permutation test.
To ensure robustness of the above results to choice of the random seed, we repeated the above experiment 500 times (Figure S11d).Averaged across runs, the GLM LRT, the T score -based permutation test, and the T resid -based permutation test yielded a mean false discovery proportion of 9.0%, 9.0%, and 2.6%, respectively.(The mean false discovery proportion functioned as an estimate for the false discovery rate.)Thus, all methods controlled the false discovery rate at the nominal level of 10%.2 Next, the GLM LRT, the T score -based permutation test, and the T resid -based permutation test yielded a mean of 36.6 discoveries, 36.3 discoveries, and 23.9 discoveries, respectively.Hence, the GLM LRT and T score -based permutation test produced about 53% more discoveries than the T resid -based permutation test.Finally, the GLM LRT, the T score -based permutation test, and the T resid -based permutation test had a mean per-hypothesis compute time of 0.021 s, 0.022 s, and 0.012 s, respectively.Thus, the GLM LRT and T score -based permutation test demonstrated similar running times, while the T resid -based permutation test was roughly twice as fast.
In summary the GLM LRT and the T score -based permutation test exhibited nearly identical performance in the simulation study: the two methods controlled type-I error at the same rate, made virtually the same number of discoveries, and had essentially the same running time.The T resid -based permutation test, on the other hand, was considerably less powerful than either the GLM LRT or the T score -based permutation test.We highlight that -in principle -the T scorebased permutation test can retain validity in a much broader range of settings than the standard GLM LRT, such as settings in which the data are sparse or the NB model is misspecified.We conjecture that the T score -based permutation test trades a small loss in power for a substantial improvement in robustness relative to a standard GLM LRT.Additional work remains to be done to understand this phenomenon.
Real data analysis.We applied both the T score -based permutation test and the T resid -based permutation test to the Frangieh IFN-γ data.Similar to the simulation study, both methods controlled type-I error, while the T score -based test produced a larger discovery set than the T resid -based test.We analyzed three sets of perturbation-gene pairs: negative control pairs, positive control pairs, and discovery pairs.The discovery pairs consisted of the entire set (i.e., the trans set) of perturbation-gene pairs (that passed pairwise QC).The negative control pairs were constructed by the SCEPTRE software and were "matched" to the discovery pairs. 3In particular, each negative control target consisted of three, randomly-grouped non-targeting gRNAs.(In this sense the calibration check analysis reported here is not precisely the same as the calibration check analysis reported in Figure 4.) Finally, the positive control pairs consisted of perturbations targeting gene transcription start sites paired to the genes regulated by those transcription start sites.
Initially, the p-values outputted by the T score -based method and the T resid -based method exhibited mild miscalibration on the negative control data: the former method yielded five false discoveries, and the latter method yielded three false discoveries (at BH level 0.1).Therefore, we slightly increased the stringency of the cellwise QC thresholds, removing cells whose percent mitochondrial value exceeded 0.1 or whose number of genes expressed or total gene UMI count fell outside of the [0.1, 0.99] percentile range.We plotted the resulting T score -based p-values against the T resid -based p-values computed on the negative control pairs (S12a), positive control pairs (S12b), and discovery pairs (S12c).We subjected both the T score -based p-values and the T resid -based p-values to a BH correction (separating out the negative control pairs, positive control pairs, and discovery pairs).We varied the nominal FDR level over the grid {0.1, 0.05, 0.01, 0.005} to explore how increasing the stringency of the multiplicity correction might affect the results.The T score -based negative control p-values and T resid -based negative control p-values were roughly symmetrically distributed about the identity line (S12a).Moreover, both the T score -based test and T resid -based test controlled type-I error on the negative control pairs, as each method produced zero or one false discoveries at the various BH levels (S12d).
Next, we observed that the positive control T score -based p-values and T resid -based p-values were highly correlated (S12b).However, the T score -based p-values were generally smaller (i.e., more significant than) their T resid -based counterparts, hinting at the greater power of the T score test statistic.Finally, the discovery T score -based p-values and T resid -based p-values also were highly correlated, and again, the T score -based p-values were generally smaller than their T resid counterparts, especially in the tail of the p-value distribution.Moreover, the T score -based method yielded a larger discovery set than the T resid -based method over the various nominal FDR levels.The relative difference between the size of the of the T score -based discovery set and T resid -based discovery set increased as the BH correction became more stringent.For example, at BH level 0.1, the T scorebased method yielded 274 discoveries, a 3% improvement over the T resid -based method, which made 266 discoveries.However, at BH level 0.005, the T score -based method produced 112 discoveries, a 29% improvement over the T resid -baesd method, which made 87 discoveries (Figure S12d).
We repeated this analysis on the Papalexi gene expression data, obtaining broadly similar results: the T score -based and T resid -based methods both controlled type-I error on the negative control data, but the T score -based method produced slightly smaller p-values on the positive control pairs, and the T score -based method yielded a larger discovery set (by about 3%) on the discovery pairs.Overall, synthesizing the results of the simulation study, the Frangieh data analysis, and the Papalexi data analysis, we concluded that the T score statistic and T resid statistic can yield quite different results.In particular, depending on the dataset under analysis and the stringency of the multiple testing adjustment, the T score statistic can produce a discovery set up to 50% larger than the one produced by the T resid statistic.

Comparing the spectral decomposition algorithm to the QR decomposition algorithm for computing GLM score tests
We proposed a spectral decomposition algorithm for computing GLM score tests within the context of SCEPTRE (see Acceleration 2: A fast score test for binary treatments).The spectral decomposition approach differs from the standard approach for computing GLM score tests; the latter is based on a QR decomposition.We conducted a small computational experiment to compare  the speed of the spectral decomposition algorithm to that of the QR decomposition algorithm.We found that the former was faster for sparse, binary treatment vectors.Furthermore, as we increased the sparsity level of the treatment vector, we found that the speed of the spectral decomposition algorithm relative to that of the QR decomposition algorithm increased.We generated synthetic data on n = 100, 000 cells, drew the covariates z 1 , . . ., z n from a bivariate Gaussian distribution, and generated the gene expression counts from a negative binomial model: We regressed the gene expression counts onto the covariates, obtaining a fitted GLM fit.matrix equal to zero.
We can model the effective sample size X of a given pair using a binomial model: Among the n trt cells that contain the perturbation, each exhibits nonzero expression of the gene with probability 1 − s.Thus, the number of cells containing both the perturbation and nonzero expression of the gene -i.e., the effective sample size -is binomially distributed with parameters n trt and 1 − s.The probability that X equals x ∈ {0, . . ., n trt } is given by the binomial pmf: The pair passes pairwise QC only if X ≥ T, i.e., the effective sample size is greater than or equal to the pairwise QC threshold.Let π be the probability that the pair passes pairwise QC. (One should think of π as a parameter specified by the experimenter, like 0.9.)Finally, let n trt (π, T, s) be the number of treatment cells required for the pair to pass pairwise QC with probability π at pairwise QC threshold T on a dataset of sparsity s.We can express n trt (π, T, s) mathematically as n trt (π, T, s) = min n ∈ {1, 2, 3, . . .} : In other words n trt (π, T, s) is the smallest integer n such that the right-tail probability of Bin(n, 1−s) evaluated T is greater than or equal to π.It is challenging to derive an exact expression for n trt (π, T, s), but we can evaluate n trt (π, T, s) numerically.In summary, suppose that the sparsity level of the dataset is s, that the pairwise QC threshold is set to T , and that the desired probability that a given pair passes pairwise QC is π.Then the experimenter should ensure that each perturbation is contained within at least n trt (π, T, s) cells.We evaluated n trt (π, T, s) for different values of π, T , and s, plotting n trt as a function of π (Figure S13).We set the sparsity level s by computing the sparsity of the datasets analyzed in this work.The four "standard" datasets with gene expression readout -i.e., Frangieh co-culture, Frangieh control, Frangieh IFN-γ, and Papalexi (gene modality) -exhibited sparsity levels of 0.76, 0.77, 0.80, and 0.77, respectively (mean sparsity level 0.78).Next, the two TAP-seq datasetsi.e., Schraivogel chromosome 11 and Schraivogel chromosome 8 -exhibited sparsity levels of 0.33 and 0.27, respectively (mean sparsity level 0.30).Finally, the Papalexi (protein modality) dataset exhibited a sparsity level of 0.0.Thus, we set the sparsity s to 0.77, 0.30, and 0.0, corresponding to a "standard" screen with gene expression readout, a TAP-seq screen with gene expression readout, and a screen with protein expression readout.
We plotted the number of required treatment cells n trt (π, T, s) as a function of the probability π that a given pair passes pairwise QC (Figure S13).The left (resp., right) panel shows n trt (π, T, s) for an effective sample size QC threshold T of 7 (resp., 14).(The default threshold that the SCEPTRE software uses is 7.) Finally, the colors within each plot indicate the different sparsity levels: 0.78 (red), 0.30 (purple), and 0.0 (blue).The visual Figure S13 can be used to help guide experimental design.For example, suppose that an experimenter is designing a screen in which she plans to sequence gene transcripts using the standard single-cell sequencing protocol.(That is, the experimenter is not using the TAP-seq protocol.)Based on prior single-cell CRISPR screens datasets, the experimenter expects the sparsity of the dataset to be about 0.78.Moreover, suppose that the experimenter intends to use the default SCEPTRE pairwise QC threshold of T = 7.Finally, suppose that experimenter aims for 95% of the pairs to pass pairwise QC.Then the experimenter should try to ensure that each perturbation is contained within at least n trt (0.95, 7, 0.95) = 50 cells.As a very rough, general guideline, experimenters probably should ensure that each perturbation is contained within at least 50-80 cells.We (i.e., the authors) intend to add functionality to the SCEPTRE software in the future to aid users in designing single-cell CRISPR screen experiments, including a module for determining the approximate number of cells required per perturbation.
We note that the simple model considered here does not account for more complex phenomena present in single-cell CRISPR screen data.For example, cells containing multiple perturbations typically are removed as part of cellwise QC in low MOI, thereby decreasing the sample size.Moreover, sparsity can vary considerably across genes, as some genes are more highly expressed or more deeply sequenced than others.We believe that the analysis above constitutes a solid starting point for reasoning about experimental design in the context of SCEPTRE; however, additional further investigation remains to be done in a followup work.

Figure S1 :
Figure S1: Calibration results for all methods on Frangieh IFN-γ, Frangieh control, and Frangieh co-culture negative control data.Left, untransformed QQ plots; middle, negative log-10 transformed QQ plots; right; number of false rejections after a Bonferroni correction at level 0.1.

Figure S2 :
Figure S2: Calibration results for all methods on Schraivogel, Papalexi (gene modality), and Papalexi (protein modality) negative control data.Interpretation is the same as in Figure S1.

Figure S3 :
Figure S3: Calibration results for all methods on simulated data.Interpretation is the same as in Figure S1.

Figure S4 :
Figure S4: Calibration results for all methods on Frangieh (IFN-γ) and Papalexi (gene modality) datasets, stratified by effective sample size.Negative control gene-gRNA pairs are partitioned into four bins of approximately equal size based on the number of treatment cells with nonzero expression in a given pair.The interval on the right-hand side of each panel indicates the minimum and maximum number of treatment cells with nonzero gene expression for pairs in that bin.Some methods (e.g., Seurat-Wilcox on the Frangieh IFN-γ data) exhibit better calibration as the number of treatment cells with nonzero expression increases (i.e., as sparsity decreases).

Figure S5 :
Figure S5: Calibration results for all methods on Frangieh (control) and simulated datasets, stratified by effective sample size.This figure is identical to Figure S4 but displays results on the Frangieh (control) and simulated datasets.

Figure S6 :
Figure S6: Confounding due to biological replicate on the Papalexi (gene modality) data.We conducted a test of association between each NT gRNA and biological replicate.We plotted the resulting p-values on a QQ plot (left); each point in the plot corresponds to a different NT gRNA.Likewise, we carried out a test of association between the expression of each gene and biological replicate (right); each point in the plot similarly corresponds to a different gene.We observed that the two sets of p-values were inflated, indicating that the bulk of the NT gRNAs and the bulk of the genes was associated with biological replicate.Because biological replicate affected both gRNA presence/absence and gene expression, we reasoned that biological replicate acted as a confounder on the Papalexi data.The section Details of the investigation into the core analysis challenges ("Confounding" subsection) describes how the tests of association were conducted.

Figure S7 :
Figure S7: Comparison of methods for enrichment of ChIP-seq signal on the Papalexi data.Number of discoveries (after a BH correction at level 0.1), ChIP-seq enrichment odds ratio, and ChIP-seq enrichment p-value of each method for the transcription factors IRF1 and STAT1.The p-value quantifies the extent to which the discovery set of a given method exhibited enrichment for cell-type-relevant ChIP-seq signal.See section ChIP-seq enrichment analysis for a description of how the ChIP-seq data were used to identify the target genes of STAT1 and IRF1.

Figure S8 :
Figure S8: Demonstration of the CAMP ("confounder adjustment via marginal permutations") phenomenon on realistic semi-synthetic data.Application of a standard permutation test, NB regression, and SCEPTRE to realistic semi-synthetic data generated under two conditions: confounded and unconfounded.Panels a and b (resp., c and d) show the results on the unconfounded (resp., confounded) data; meanwhile, panels a and c (resp., b and d) show the results under correct (resp.incorrect) specification of the negative binomial size parameter.The permutation test (gray) works well when the data are unconfounded (panels a and b) but breaks down in the presence of confounding (panelsc and d).On the other hand, NB regression is wellcalibrated when the size parameter is correctly specified (panels a and c) but fails when the size parameter is misspecified (panels b and d).SCEPTRE is well-calibrated in all settings.We note that SCEPTRE is expected to break down when the (i) problem is confounded and the NB regression model is arbitrarily misspecified or (ii) the problem is confounded and the sparsity is high.Details of the simulation study are given in Section CAMP simulation study details.

Figure S9 :
Figure S9: SCEPTRE's power to detect associations increases as effective sample size increases.a-c, SCEPTRE p-value (truncated at 10 −6 ) versus effective sample size for each pair on the Frangieh co-culture (a), control (b), and IFN-γ (c) positive control data.The horizontal dashed line is drawn at 10 −5 , demarcating a highly significant discovery.SCEPTRE makes only one rejection at a highly significant level on pairs for which the effective sample size less than seven (blue region).

Figure S10 :
Figure S10: Inductive without replacement (IWOR) sampling is an approximation-free technique for sharing permutation indices a large number of permutation tests.An "IWOR sample" (top) is a single vector from which the set of permutation indices for each hypothesis can be extracted.The permutation indices for pair A consist of the first four elements of the IWOR sample; the permutation indices for pair B consist of the first five elements of the IWOR sample; and the permutation indices for pair C consist of the first six elements of the IWOR sample.Importantly, the cells themselves are not shared across hypotheses; rather, the IWOR sample is shared across hypotheses.IWOR sampling assumes that the number of control units is fixed across hypotheses; this assumption holds on low-MOI single-cell CRISPR screen data.
floor operator end return d repeat this process until we have iterated through all of the treatment cells.This algorithm is fast and efficient, taking O(M ) time and O(M + N ) space.Algorithm 2: Generating an IWOR sample of size M given control cells c 1 , . . ., c N and treatment cells t 1 , . . ., t M .Input Control cells c 1 , . . ., c N and treatment cells t 1 , . . ., t M .Initialize an empty vector v of size M .Initialize the vector r ← [c 1 , c 2 , . . ., c N , t 1 ]. for i = 1 . . .M do pos ∼ IWOR(N, i) // sample a position within r v[i − 1] ← r[pos] // move the element at that position into v r[pos] ← r[N ] // move the rightmost entry of r to position pos r[N ] ← t i+1 // update the rightmost entry of r with t i+1 end return v

Figure S11 :
Figure S11: A comparison of the T score and T resid permutation test statistics on data simulated from an NB GLM.a, GLM likelihood ratio test (LRT) p-values vs. T resid -based permutation test p-values.The T resid -based p-values were generally larger (i.e., more conservative) than the GLM LRT p-values.b, GLM LRT p-values vs. T score -based permutation test p-values.These two sets of p-values were highly concordant.c, T score -based permutation test p-values vs. T resid -based permutation test p-values.The T resid -based permutation test p-values were generally more conservative than the T score -based permutation test p-values.d, mean false discovery proportion, number of discoveries, and running time of the GLM LRT, T resid -based permutation test, and T score -based permutation test across 500 replicates of the experiment.The GLM LRT and T score -based permutation test exhibited near-identical statistical and computational performance.The T resid -based permutation test exhibited lower power than the former two methods.

Figure S12 :
Figure S12: A comparison of the T score and T resid permutation test statistics on the Frangieh IFN-γ data.T score -based p-values vs. T resid -based p-values on the negative control (a), positive control (b), and discovery (c) pairs.d, number of rejections made by the T resid -based method (top row) and T score -based method (bottom row) on the negative control (top table)and discovery (bottom table) pairs.The various nominal FDR levels -0.1, 0.05, 0.01, and 0.005are in the columns.

Figure S13 :
Figure S13: Number of cells per perturbation vs. probability that a given pair passes pairwise QC.A plot of the number of cells required per perturbation (n trt (π, T, s)) vs. the probability that a given pair passes pairwise QC (π).The left (resp., right) plot considers setting the pairwise QC threshold T to 7 (resp., 14).Finally, the colors indicate different sparsity levels s: red = 0.78, purple = 0.30, and blue = 0.0.

Table S1 :
S1 Supplementary tables and figures referenced in the maintext Datasets analyzed in this work.The first column indicates the name of a low-MOI single-cell CRISPR screen paper; the second column indicates the datasets that we obtained from that paper; and the subsequent columns indicate the (paper-specific) biological attributes of the data, including CRISPR modality, technology platform, target type, cellular modality measured, and cell type.
*The Frangieh data also contain protein measurements, but we focus exclusively on the gene modality in this work.

Table S2 :
Statistical attributes of the datasets.The number of genes, cells, targeting gRNAs, NT gRNAs, negative control pairs, and positive control pairs for each dataset.Neg., negative; pos., positive.